{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Pima Indians Diabetes Data Set 特征工程\n",
    "\n",
    "数据说明：\n",
    "Pima Indians Diabetes Data Set（皮马印第安人糖尿病数据集）   \n",
    "根据现有的医疗信息预测5年内皮马印第安人糖尿病发作的概率。     \n",
    "  \n",
    "数据集共9个字段:   \n",
    "0列为 Pregnancies(怀孕次数)；  \n",
    "1列为 Glucose(口服葡萄糖耐量试验中2小时后的血浆葡萄糖浓度)；  \n",
    "2列为 BloodPressure(舒张压,单位:mm Hg）  \n",
    "3列为 SkinThickness(三头肌皮褶厚度,单位：mm）  \n",
    "4列为 Insulin(餐后血清胰岛素,单位:mm）  \n",
    "5列为 BMI,体重指数（体重（公斤）/ 身高（米）^2）  \n",
    "6列为 DiabetesPedigreeFunction(糖尿病家系作用)  \n",
    "7列为 Age(年龄)  \n",
    "8列为 Outcome (分类变量, 0 或 1）  \n",
    "   \n",
    "数据链接：https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes  \n",
    "  \n",
    "p.s.: Kaggle也有一个Practice Fusion Diabetes Classification任务，可以试试:)  \n",
    "https://www.kaggle.com/c/pf2012-diabetes "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## import 必要的工具包，用于文件读取／特征编码"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "数据文件路径和文件名"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Pregnancies</th>\n",
       "      <th>Glucose</th>\n",
       "      <th>BloodPressure</th>\n",
       "      <th>SkinThickness</th>\n",
       "      <th>Insulin</th>\n",
       "      <th>BMI</th>\n",
       "      <th>DiabetesPedigreeFunction</th>\n",
       "      <th>Age</th>\n",
       "      <th>Outcome</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>6</td>\n",
       "      <td>148</td>\n",
       "      <td>72</td>\n",
       "      <td>35</td>\n",
       "      <td>0</td>\n",
       "      <td>33.6</td>\n",
       "      <td>0.627</td>\n",
       "      <td>50</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>85</td>\n",
       "      <td>66</td>\n",
       "      <td>29</td>\n",
       "      <td>0</td>\n",
       "      <td>26.6</td>\n",
       "      <td>0.351</td>\n",
       "      <td>31</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>8</td>\n",
       "      <td>183</td>\n",
       "      <td>64</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>23.3</td>\n",
       "      <td>0.672</td>\n",
       "      <td>32</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>89</td>\n",
       "      <td>66</td>\n",
       "      <td>23</td>\n",
       "      <td>94</td>\n",
       "      <td>28.1</td>\n",
       "      <td>0.167</td>\n",
       "      <td>21</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>137</td>\n",
       "      <td>40</td>\n",
       "      <td>35</td>\n",
       "      <td>168</td>\n",
       "      <td>43.1</td>\n",
       "      <td>2.288</td>\n",
       "      <td>33</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \\\n",
       "0            6      148             72             35        0  33.6   \n",
       "1            1       85             66             29        0  26.6   \n",
       "2            8      183             64              0        0  23.3   \n",
       "3            1       89             66             23       94  28.1   \n",
       "4            0      137             40             35      168  43.1   \n",
       "\n",
       "   DiabetesPedigreeFunction  Age  Outcome  \n",
       "0                     0.627   50        1  \n",
       "1                     0.351   31        0  \n",
       "2                     0.672   32        1  \n",
       "3                     0.167   21        0  \n",
       "4                     2.288   33        1  "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train = pd.read_csv(\"pima-indians-diabetes.csv\")\n",
    "train.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 768 entries, 0 to 767\n",
      "Data columns (total 9 columns):\n",
      " #   Column                    Non-Null Count  Dtype  \n",
      "---  ------                    --------------  -----  \n",
      " 0   Pregnancies               768 non-null    int64  \n",
      " 1   Glucose                   768 non-null    int64  \n",
      " 2   BloodPressure             768 non-null    int64  \n",
      " 3   SkinThickness             768 non-null    int64  \n",
      " 4   Insulin                   768 non-null    int64  \n",
      " 5   BMI                       768 non-null    float64\n",
      " 6   DiabetesPedigreeFunction  768 non-null    float64\n",
      " 7   Age                       768 non-null    int64  \n",
      " 8   Outcome                   768 non-null    int64  \n",
      "dtypes: float64(2), int64(7)\n",
      "memory usage: 54.1 KB\n"
     ]
    }
   ],
   "source": [
    "# 查看数据基本信息\n",
    "# 每列的资料型态，数据有无缺失，记忆体用量等\n",
    "train.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "该数据集已知存在缺失值，某些列中存在的缺失值会被自动标记为 0，通过这些列中指标的定义和相应领域的常识可以证实上述观点，譬如体重指数和血压两列中的 0 作为指标数值来说是无意义的。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Pregnancies</th>\n",
       "      <th>Glucose</th>\n",
       "      <th>BloodPressure</th>\n",
       "      <th>SkinThickness</th>\n",
       "      <th>Insulin</th>\n",
       "      <th>BMI</th>\n",
       "      <th>DiabetesPedigreeFunction</th>\n",
       "      <th>Age</th>\n",
       "      <th>Outcome</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>768.000000</td>\n",
       "      <td>768.000000</td>\n",
       "      <td>768.000000</td>\n",
       "      <td>768.000000</td>\n",
       "      <td>768.000000</td>\n",
       "      <td>768.000000</td>\n",
       "      <td>768.000000</td>\n",
       "      <td>768.000000</td>\n",
       "      <td>768.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>3.845052</td>\n",
       "      <td>120.894531</td>\n",
       "      <td>69.105469</td>\n",
       "      <td>20.536458</td>\n",
       "      <td>79.799479</td>\n",
       "      <td>31.992578</td>\n",
       "      <td>0.471876</td>\n",
       "      <td>33.240885</td>\n",
       "      <td>0.348958</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>3.369578</td>\n",
       "      <td>31.972618</td>\n",
       "      <td>19.355807</td>\n",
       "      <td>15.952218</td>\n",
       "      <td>115.244002</td>\n",
       "      <td>7.884160</td>\n",
       "      <td>0.331329</td>\n",
       "      <td>11.760232</td>\n",
       "      <td>0.476951</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.078000</td>\n",
       "      <td>21.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>1.000000</td>\n",
       "      <td>99.000000</td>\n",
       "      <td>62.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>27.300000</td>\n",
       "      <td>0.243750</td>\n",
       "      <td>24.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>3.000000</td>\n",
       "      <td>117.000000</td>\n",
       "      <td>72.000000</td>\n",
       "      <td>23.000000</td>\n",
       "      <td>30.500000</td>\n",
       "      <td>32.000000</td>\n",
       "      <td>0.372500</td>\n",
       "      <td>29.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>6.000000</td>\n",
       "      <td>140.250000</td>\n",
       "      <td>80.000000</td>\n",
       "      <td>32.000000</td>\n",
       "      <td>127.250000</td>\n",
       "      <td>36.600000</td>\n",
       "      <td>0.626250</td>\n",
       "      <td>41.000000</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>17.000000</td>\n",
       "      <td>199.000000</td>\n",
       "      <td>122.000000</td>\n",
       "      <td>99.000000</td>\n",
       "      <td>846.000000</td>\n",
       "      <td>67.100000</td>\n",
       "      <td>2.420000</td>\n",
       "      <td>81.000000</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       Pregnancies     Glucose  BloodPressure  SkinThickness     Insulin  \\\n",
       "count   768.000000  768.000000     768.000000     768.000000  768.000000   \n",
       "mean      3.845052  120.894531      69.105469      20.536458   79.799479   \n",
       "std       3.369578   31.972618      19.355807      15.952218  115.244002   \n",
       "min       0.000000    0.000000       0.000000       0.000000    0.000000   \n",
       "25%       1.000000   99.000000      62.000000       0.000000    0.000000   \n",
       "50%       3.000000  117.000000      72.000000      23.000000   30.500000   \n",
       "75%       6.000000  140.250000      80.000000      32.000000  127.250000   \n",
       "max      17.000000  199.000000     122.000000      99.000000  846.000000   \n",
       "\n",
       "              BMI  DiabetesPedigreeFunction         Age     Outcome  \n",
       "count  768.000000                768.000000  768.000000  768.000000  \n",
       "mean    31.992578                  0.471876   33.240885    0.348958  \n",
       "std      7.884160                  0.331329   11.760232    0.476951  \n",
       "min      0.000000                  0.078000   21.000000    0.000000  \n",
       "25%     27.300000                  0.243750   24.000000    0.000000  \n",
       "50%     32.000000                  0.372500   29.000000    0.000000  \n",
       "75%     36.600000                  0.626250   41.000000    1.000000  \n",
       "max     67.100000                  2.420000   81.000000    1.000000  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 查看数值型特征的基本统计量\n",
    "train.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "从结果中我们可以看到很多列的最小值为 0，而在一些特定列的数值中，0 值并没有意义，可以表名该值无效或为缺失值。  \n",
    "  \n",
    "具体来说，下列变量的最小值为 0 时该 0 值无意义：  \n",
    "1、血浆葡萄糖浓度  \n",
    "2、舒张压  \n",
    "3、肱三头肌皮褶厚度  \n",
    "4、餐后血清胰岛素  \n",
    "5、体重指数  \n",
    "\n",
    "在 Pandas 的 DataFrame 中，通过 replace() 函数可以很方便的将我们感兴趣的数据子集的值标记为 NaN。\n",
    "标记完缺失值之后，可以利用 isnull() 函数将数据集中所有的 NaN 值标记为 True，然后就可以得到每一列中缺失值的数量了。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Pregnancies                   0\n",
      "Glucose                       5\n",
      "BloodPressure                35\n",
      "SkinThickness               227\n",
      "Insulin                     374\n",
      "BMI                          11\n",
      "DiabetesPedigreeFunction      0\n",
      "Age                           0\n",
      "Outcome                       0\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "NaN_col_names = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']\n",
    "train[NaN_col_names] = train[NaN_col_names].replace(0, np.NaN)\n",
    "print(train.isnull().sum())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 对缺失值较多的特征，新增一个特征，表示这个特征是否缺失"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>SkinThickness</th>\n",
       "      <th>SkinThickness_Missing</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>35.0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>29.0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>23.0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>35.0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>32.0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>45.0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   SkinThickness  SkinThickness_Missing\n",
       "0           35.0                      0\n",
       "1           29.0                      0\n",
       "2            NaN                      1\n",
       "3           23.0                      0\n",
       "4           35.0                      0\n",
       "5            NaN                      1\n",
       "6           32.0                      0\n",
       "7            NaN                      1\n",
       "8           45.0                      0\n",
       "9            NaN                      1"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 缺失值比较多，干脆就开一个新的字段，表明是缺失值还是不是缺失值\n",
    "train['SkinThickness_Missing'] = train['SkinThickness'].apply(lambda x: 1 if pd.isnull(x) else 0)\n",
    "train[['SkinThickness','SkinThickness_Missing']].head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<matplotlib.axes._subplots.AxesSubplot at 0x186029b13c8>"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYUAAAEHCAYAAABBW1qbAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAAY1ElEQVR4nO3de5RU5Z3u8e8jom0O3hD0II0BFBNlcRE7ipIVLySjcjRoRnPgGJVLTmc8OiGJxjCTOSbxRI8zJrI0jq7FHBVxORDiZWAcdVQIunSMpDEtV41MVCglctGgBhVpf+ePentbYnVTQO+qhn4+a9Wq2u9+371/5WL14977rb0VEZiZmQHsVesCzMys83AomJlZxqFgZmYZh4KZmWUcCmZmltm71gXsil69ekX//v1rXYaZ2W5l8eLFGyKid7l1u3Uo9O/fn6amplqXYWa2W5H0alvrfPrIzMwyDgUzM8s4FMzMLLNbX1MwM+tIH374IYVCgffff7/WpXSIuro66uvr6d69e8VjHApmZkmhUGD//fenf//+SKp1ObskIti4cSOFQoEBAwZUPM6nj8zMkvfff59DDjlktw8EAEkccsghO3zU41AwMyuxJwRCq535Lg4FMzPLOBTMzNpRKBQYO3YsgwYN4sgjj2TKlCls2bKl3THXXXddlarreNqdH7LT0NAQu/qL5uO/P7ODqtn9Lb7h4lqXYFZTK1eu5JhjjsmWI4ITTzyRSy+9lIkTJ9LS0kJjYyM9e/bkhhtuaHM7PXr04N13361Gydu17XcCkLQ4IhrK9feRgplZGxYsWEBdXR0TJ04EoFu3bkybNo077riDW2+9lcsvvzzre/bZZ7Nw4UKmTp3Ke++9x/Dhw7nwwgsBmDlzJkOHDmXYsGFcdNFFALz66quMHj2aoUOHMnr0aFavXg3AhAkTuPTSSznttNMYOHAgTzzxBJMmTeKYY45hwoQJ2f4effRRTjrpJEaMGMEFF1zQYSHkUDAza8Py5cs5/vjjP9F2wAEHcMQRR7B169ayY66//nr2228/mpubueeee1i+fDnXXnstCxYs4Pnnn+emm24C4PLLL+fiiy9myZIlXHjhhXz729/OtvHWW2+xYMECpk2bxjnnnMN3v/tdli9fztKlS2lubmbDhg389Kc/5fHHH+e5556joaGBG2+8sUO+c26/U5BUBzwJ7Jv2c29E/EjSDOAUYFPqOiEimlW8TH4TMAbYnNqfy6s+M7PtiYiyM3jaai9nwYIFnH/++fTq1QuAnj17AvDMM89w//33A3DRRRdx1VVXZWPOOeccJDFkyBAOO+wwhgwZAsDgwYN55ZVXKBQKrFixglGjRgGwZcsWTjrppJ3/oiXy/PHaB8DpEfGupO7AU5IeTuu+HxH3btP/LGBQep0I3JbezcxqYvDgwdx3332faHv77bdZs2YNBx54IB999FHW3tbvASoNkNI+++67LwB77bVX9rl1eevWrXTr1o2vfOUrzJo1a4e+TyVyO30URa0nubqnV3tXtccCM9O43wAHSeqTV31mZtszevRoNm/ezMyZxQkpLS0tXHHFFUyYMIGBAwfS3NzMRx99xJo1a1i0aFE2rnv37nz44YfZNubMmcPGjRsBePPNNwE4+eSTmT17NgD33HMPX/ziFyuua+TIkTz99NOsWrUKgM2bN/P73/9+178wOV9TkNRNUjOwDngsIp5Nq66VtETSNEmtMdgXWFMyvJDatt1mo6QmSU3r16/Ps3wz6+Ik8cADD/CrX/2KQYMGcfTRR1NXV8d1113HqFGjGDBgAEOGDOHKK69kxIgR2bjGxkaGDh3KhRdeyODBg/nhD3/IKaecwrBhw/je974HwM0338ydd97J0KFDufvuu7NrDZXo3bs3M2bMYPz48QwdOpSRI0fywgsvdMx3rsaUVEkHAQ8Afw1sBP4I7ANMB/4zIq6R9G/A/42Ip9KY+cBVEbG4re16SmrH8pRU6+rKTd/c3XXKKakR8SdgIXBmRKxNp4g+AO4ETkjdCkC/kmH1wOvVqM/MzIpyCwVJvdMRApL2A74MvNB6nSDNNjoXWJaGzAMuVtFIYFNErM2rPjMz+7Q8Zx/1Ae6S1I1i+MyJiAclLZDUGxDQDPxV6v8QxemoqyhOSZ2YY21mZlZGbqEQEUuA48q0n95G/wAuy6seMzPbPv+i2czMMg4FMzPL+HGcZmY7oKOnsVc6FfyRRx5hypQptLS08M1vfpOpU6d2aB2tfKRgZtbJtbS0cNlll/Hwww+zYsUKZs2axYoVK3LZl0PBzKyTW7RoEUcddRQDBw5kn332Ydy4ccydOzeXfTkUzMw6uddee41+/T7+bW99fT2vvfZaLvtyKJiZdXLlbkdU6a27d5RDwcysk6uvr2fNmo/vF1ooFDj88MNz2ZdDwcysk/vCF77ASy+9xMsvv8yWLVuYPXs2X/3qV3PZl6ekmpntgFrcTXjvvffmlltu4YwzzqClpYVJkyYxePDgfPaVy1bNzKxDjRkzhjFjxuS+H58+MjOzjEPBzMwyDgUzM8s4FMzMLONQMDOzjEPBzMwynpJqZrYDVl8zpEO3d8TVS7fbZ9KkSTz44IMceuihLFu2bLv9d4WPFMzMOrkJEybwyCOPVGVfDgUzs07uS1/6Ej179qzKvnILBUl1khZJel7Sckk/Se0DJD0r6SVJv5S0T2rfNy2vSuv751WbmZmVl+eRwgfA6RExDBgOnClpJPD3wLSIGAS8BUxO/ScDb0XEUcC01M/MzKoot1CIonfTYvf0CuB04N7Ufhdwbvo8Ni2T1o9WXjcMNzOzsnK9piCpm6RmYB3wGPCfwJ8iYmvqUgD6ps99gTUAaf0m4JAy22yU1CSpaf369XmWb2bW5eQ6JTUiWoDhkg4CHgCOKdctvZc7KvjU44YiYjowHaChoeHTjyMyM8tRJVNIO9r48eNZuHAhGzZsoL6+np/85CdMnjx5+wN3QlV+pxARf5K0EBgJHCRp73Q0UA+8nroVgH5AQdLewIHAm9Woz8ysM5s1a1bV9pXn7KPe6QgBSfsBXwZWAr8Gzk/dLgHmps/z0jJp/YIo92BSMzPLTZ5HCn2AuyR1oxg+cyLiQUkrgNmSfgr8Drg99b8duFvSKopHCONyrM3MzMrILRQiYglwXJn2PwAnlGl/H7ggr3rMzCoREewpEx935mSLf9FsZpbU1dWxcePGnfpj2tlEBBs3bqSurm6HxvmGeGZmSX19PYVCgT1luntdXR319fU7NMahYGaWdO/enQEDBtS6jJry6SMzM8s4FMzMLONQMDOzjEPBzMwyDgUzM8s4FMzMLONQMDOzjEPBzMwyDgUzM8s4FMzMLONQMDOzjEPBzMwyDgUzM8s4FMzMLONQMDOzjEPBzMwyuYWCpH6Sfi1ppaTlkqak9h9Lek1Sc3qNKRnzN5JWSXpR0hl51WZmZuXl+eS1rcAVEfGcpP2BxZIeS+umRcTPSjtLOhYYBwwGDgcel3R0RLTkWKOZmZXI7UghItZGxHPp8zvASqBvO0PGArMj4oOIeBlYBZyQV31mZvZpVbmmIKk/cBzwbGq6XNISSXdIOji19QXWlAwrUCZEJDVKapLUtKc8XNvMrLPIPRQk9QDuA74TEW8DtwFHAsOBtcDPW7uWGR6faoiYHhENEdHQu3fvnKo2M+uacg0FSd0pBsI9EXE/QES8EREtEfER8E98fIqoAPQrGV4PvJ5nfWZm9kl5zj4ScDuwMiJuLGnvU9LtPGBZ+jwPGCdpX0kDgEHAorzqMzOzT8tz9tEo4CJgqaTm1Pa3wHhJwymeGnoF+BZARCyXNAdYQXHm0mWeeWRmVl25hUJEPEX56wQPtTPmWuDavGoyM7P2+RfNZmaWcSiYmVnGoWBmZhmHgpmZZRwKZmaWcSiYmVnGoWBmZhmHgpmZZRwKZmaWcSiYmVnGoWBmZhmHgpmZZRwKZmaWcSiYmVnGoWBmZhmHgpmZZRwKZmaWcSiYmVmmolCQNL+SNjMz2721+4xmSXXAZ4Bekg7m42cuHwAcnnNtZmZWZds7UvgWsBj4fHpvfc0F/rG9gZL6Sfq1pJWSlkuaktp7SnpM0kvp/eDULkk3S1olaYmkEbv65czMbMe0GwoRcVNEDACujIiBETEgvYZFxC3b2fZW4IqIOAYYCVwm6VhgKjA/IgYB89MywFnAoPRqBG7b+a9lZmY7o93TR60i4heSTgb6l46JiJntjFkLrE2f35G0EugLjAVOTd3uAhYCP0jtMyMigN9IOkhSn7QdMzOrgopCQdLdwJFAM9CSmgNoMxS2Gd8fOA54Fjis9Q99RKyVdGjq1hdYUzKskNo+EQqSGikeSXDEEUdUsnszM6tQRaEANADHpv+L3yGSegD3Ad+JiLcltdm1TNun9hcR04HpAA0NDTtcj5mZta3S3yksA/7rjm5cUneKgXBPRNyfmt+Q1Cet7wOsS+0FoF/J8Hrg9R3dp5mZ7bxKQ6EXsELSv0ua1/pqb4CKhwS3Aysj4saSVfOAS9LnSyjOZGptvzjNQhoJbPL1BDOz6qr09NGPd2Lbo4CLgKWSmlPb3wLXA3MkTQZWAxekdQ8BY4BVwGZg4k7s08zMdkGls4+e2NENR8RTlL9OADC6TP8ALtvR/ZiZWcepdPbRO3x80XcfoDvw54g4IK/CzMys+io9Uti/dFnSucAJuVRkZmY1s1N3SY2IfwFO7+BazMysxio9ffS1ksW9KP5uwb8RMDPbw1Q6++icks9bgVco3pbCzMz2IJVeU/D0UDOzLqDS00f1wC8o/vYggKeAKRFRyLE2q7LV1wypdQmdxhFXL611CWY1UemF5jsp/uL4cIo3qfvX1GZmZnuQSkOhd0TcGRFb02sG0DvHuszMrAYqDYUNkr4hqVt6fQPYmGdhZmZWfZWGwiTg68AfKT7f4Hx8byIzsz1OpVNS/w9wSUS8BcXnLAM/oxgWZma2h6j0SGFoayAARMSbFJ+kZmZme5BKQ2EvSQe3LqQjhUqPMszMbDdR6R/2nwP/Ieleir9T+DpwbW5VmZlZTVT6i+aZkpoo3gRPwNciYkWulZmZWdVVfAoohYCDwMxsD7ZTt842M7M9k0PBzMwyDgUzM8vkFgqS7pC0TtKykrYfS3pNUnN6jSlZ9zeSVkl6UdIZedVlZmZty/NIYQZwZpn2aRExPL0eApB0LDAOGJzG3CqpW461mZlZGbmFQkQ8CbxZYfexwOyI+CAiXgZWASfkVZuZmZVXi2sKl0takk4vtf5Kui+wpqRPIbV9iqRGSU2SmtavX593rWZmXUq1Q+E24EhgOMW7rf48tatM3yi3gYiYHhENEdHQu7cf6WBm1pGqGgoR8UZEtETER8A/8fEpogLQr6RrPfB6NWszM7Mqh4KkPiWL5wGtM5PmAeMk7StpADAIWFTN2szMLMc7nUqaBZwK9JJUAH4EnCppOMVTQ68A3wKIiOWS5lC8jcZW4LKIaMmrNjMzKy+3UIiI8WWab2+n/7X4zqtmZjXlZyKYdVLHf39mrUvoNBbfcHGtS+gyfJsLMzPLOBTMzCzjUDAzs4xDwczMMg4FMzPLOBTMzCzjUDAzs4xDwczMMg4FMzPLOBTMzCzjUDAzs4xDwczMMg4FMzPLOBTMzCzjUDAzs4xDwczMMg4FMzPLOBTMzCyTWyhIukPSOknLStp6SnpM0kvp/eDULkk3S1olaYmkEXnVZWZmbcvzSGEGcOY2bVOB+RExCJiflgHOAgalVyNwW451mZlZG3ILhYh4Enhzm+axwF3p813AuSXtM6PoN8BBkvrkVZuZmZVX7WsKh0XEWoD0fmhq7wusKelXSG1mZlZFneVCs8q0RdmOUqOkJklN69evz7ksM7Oupdqh8EbraaH0vi61F4B+Jf3qgdfLbSAipkdEQ0Q09O7dO9dizcy6mmqHwjzgkvT5EmBuSfvFaRbSSGBT62kmMzOrnr3z2rCkWcCpQC9JBeBHwPXAHEmTgdXABan7Q8AYYBWwGZiYV11mZta23EIhIsa3sWp0mb4BXJZXLWZmVpnOcqHZzMw6gdyOFMzMOsrqa4bUuoRO44irl+a6fR8pmJlZxqFgZmYZh4KZmWUcCmZmlnEomJlZxqFgZmYZh4KZmWUcCmZmlnEomJlZxqFgZmYZh4KZmWUcCmZmlnEomJlZxqFgZmYZh4KZmWUcCmZmlnEomJlZxqFgZmaZmjyOU9IrwDtAC7A1Ihok9QR+CfQHXgG+HhFv1aI+M7OuqpZHCqdFxPCIaEjLU4H5ETEImJ+WzcysijrT6aOxwF3p813AuTWsxcysS6pVKATwqKTFkhpT22ERsRYgvR9abqCkRklNkprWr19fpXLNzLqGmlxTAEZFxOuSDgUek/RCpQMjYjowHaChoSHyKtDMrCuqyZFCRLye3tcBDwAnAG9I6gOQ3tfVojYzs66s6qEg6b9I2r/1M/AXwDJgHnBJ6nYJMLfatZmZdXW1OH10GPCApNb9/3NEPCLpt8AcSZOB1cAFNajNzKxLq3ooRMQfgGFl2jcCo6tdj5mZfawzTUk1M7MacyiYmVnGoWBmZhmHgpmZZRwKZmaWcSiYmVnGoWBmZhmHgpmZZRwKZmaWcSiYmVnGoWBmZhmHgpmZZRwKZmaWcSiYmVnGoWBmZhmHgpmZZRwKZmaWcSiYmVnGoWBmZhmHgpmZZTpdKEg6U9KLklZJmlrreszMupJOFQqSugH/CJwFHAuMl3RsbasyM+s6OlUoACcAqyLiDxGxBZgNjK1xTWZmXcbetS5gG32BNSXLBeDE0g6SGoHGtPiupBerVNse77PQC9hQ6zo6hR+p1hVYCf/bLNEx/zY/29aKzhYK5b5tfGIhYjowvTrldC2SmiKiodZ1mG3L/zarp7OdPioA/UqW64HXa1SLmVmX09lC4bfAIEkDJO0DjAPm1bgmM7Muo1OdPoqIrZIuB/4d6AbcERHLa1xWV+LTctZZ+d9mlSgitt/LzMy6hM52+sjMzGrIoWBmZhmHgvnWItZpSbpD0jpJy2pdS1fhUOjifGsR6+RmAGfWuoiuxKFgvrWIdVoR8STwZq3r6EocClbu1iJ9a1SLmdWYQ8G2e2sRM+s6HArmW4uYWcahYL61iJllHApdXERsBVpvLbISmONbi1hnIWkW8AzwOUkFSZNrXdOezre5MDOzjI8UzMws41AwM7OMQ8HMzDIOBTMzyzgUzMws41AwM7OMQ8FqStIPJS2XtERSs6QTJb0iqVeZvv+xnW09kLaxStKm9LlZ0sntbPOr7d0uXFL/znDbZkmnSorSefqSjkttV6blayR9eQe32yDp5o6u13ZfneoZzda1SDoJOBsYEREfpD/a+7TVPyJObm97EXFe2u6pwJURcXbJvtoaM4/d5xfcS4H/DtyelscBz7eujIird3SDEdEENHVIdbZH8JGC1VIfYENEfAAQERsiIrvvkqT9JD0i6X+m5XfT+6mSFkq6V9ILku5RW3/1P+mvJT0naamkz6dtTZB0S/p8WDraeD69PhFCkgZK+p2kL6Rx96f6XpL0DyX9/kLSM2lfv5LUI7VfL2lFOir6WWq7QNKytL8nt1P/aqAu1SmKzxl4uGS/MySdvyP7Sv8tH0yff5wearNQ0h8kfbtk2/87/bd+TNKs1qMT2/P4SMFq6VHgakm/Bx4HfhkRT6R1PSg+22FmRMwsM/Y4YDDFm/c9DYwCntrO/jZExAhJ/wu4EvjmNutvBp6IiPPSw4d6AAcDSPpcqmdiRDRLGgwMT3V8ALwo6RfAe8DfAV+OiD9L+gHwvRQ85wGfj4iQdFDa59XAGRHxWklbe+4FLgB+BzyX9v0Jknruwr4+D5wG7J++023AMOAv03fdO+13cQW12m7IRwpWMxHxLnA80AisB34paUJaPRe4s41AAFgUEYWI+AhoBvpXsMv70/viNvqfDtyWamuJiE2pvXeq5xsR0VzSf35EbIqI94EVwGeBkRSfYPe0pGbgktT+NvA+8P8kfQ3YnLbxNDAjHQ11q+A7zKEYCuOBWW302ZV9/VtEfBARG4B1wGHAF4G5EfFeRLwD/GsFddpuyqFgNZX++C6MiB9RvDHfX6ZVTwNntXNaqPT/kFuo7Ki3dUyl/VttovggolEV1CDgsYgYnl7HRsTkdOPBE4D7gHOBRwAi4q8oHln0A5olHdJeIRHxR+BD4CvA/Db67Mq+2vpO1kU4FKxmJH1O0qCSpuHAq+nz1cBG4NYqljQfuDTV1k3SAal9C8U/rhdL+h/b2cZvgFGSjkrb+Yyko9N1hQMj4iHgOxS/K5KOjIhn00XiDXzy2RZtuRr4QUS0lFvZwfuC4mm5cyTVpW3/twrH2W7I1xSslnoAv0jnt7cCqyieSmqdNfQd4A5J/xARV1WhninA9DTts4ViQKwFSNcHzgYek/TntjYQEevTKbBZkvZNzX8HvAPMlVRH8f+8v5vW3ZCCURRD6Xm2IyLanZpL8XpApfs6pYL9/VbSvNT/VYqzlTa1P8p2V751tpltl6QeEfGupM8ATwKNEfFcreuyjucjBTOrxHRJxwJ1wF0OhD2XjxTMOhFJZwB/v03zy60/zDPLm0PBzMwynn1kZmYZh4KZmWUcCmZmlnEomJlZ5v8DY3BmHVAZmDwAAAAASUVORK5CYII=\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "import matplotlib.pyplot as plt  # 绘图\n",
    "import seaborn as sns  # matplotlib 的封装，较易使用\n",
    "\n",
    "# 自动显示图片，不须呼叫 show() 函数\n",
    "%matplotlib inline  \n",
    "sns.countplot(x=\"SkinThickness_Missing\", hue=\"Outcome\",data=train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<matplotlib.axes._subplots.AxesSubplot at 0x18603cc2148>"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYUAAAEHCAYAAABBW1qbAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAAUz0lEQVR4nO3df5BV5Z3n8fdXIOnsaoxI66qNAQzZKAsSbQ1KNtGhXBM2gklpVoqogBlmXKk4idksk2zl10bLKhMtXRNrzQYRC2HJqKvlukyMjEnFMWHQQuSHRioaaWQU0Ipx/IHgd//o049X7YYL9O3bdL9fVbfuOc99zjnfWwV8OM8557mRmUiSBHBQswuQJPUfhoIkqTAUJEmFoSBJKgwFSVIxtNkF7I8RI0bkqFGjml2GJB1QHnnkkW2Z2drdZwd0KIwaNYpVq1Y1uwxJOqBExB97+szhI0lSYShIkgpDQZJUHNDXFCSpN7355pt0dHTw+uuvN7uUXtHS0kJbWxvDhg2rextDQZIqHR0dHHLIIYwaNYqIaHY5+yUz2b59Ox0dHYwePbru7Rw+kqTK66+/zuGHH37ABwJARHD44Yfv9VmPoSBJNQZCIHTZl+9iKEiSCkNBknajo6OD6dOnM3bsWI477jguv/xyduzYsdttrrrqqj6qrvfFgfwjO+3t7bm/TzSf/F8W9VI1B75Hrrmo2SVITbVhwwaOP/74sp6ZfOITn+DSSy9l9uzZ7Nq1i7lz5zJ8+HCuueaaHvdz8MEH88orr/RFyXv07u8EEBGPZGZ7d/09U5CkHqxYsYKWlhZmz54NwJAhQ7juuutYsGABP/nJT5g3b17p+7nPfY4HH3yQ+fPn89prrzFx4kRmzpwJwKJFi5gwYQInnngiF154IQB//OMfmTJlChMmTGDKlCk8++yzAMyaNYtLL72UM888kzFjxvCrX/2KOXPmcPzxxzNr1qxyvF/84hecdtppnHTSSZx//vm9FkKGgiT1YN26dZx88snvaPvgBz/Isccey86dO7vd5uqrr+YDH/gAq1evZvHixaxbt44rr7ySFStW8Nhjj3H99dcDMG/ePC666CLWrFnDzJkz+cpXvlL28dJLL7FixQquu+46zjnnHL761a+ybt06Hn/8cVavXs22bdv4wQ9+wC9/+UseffRR2tvbufbaa3vlO/ucgiT1IDO7vYOnp/burFixgvPOO48RI0YAMHz4cAAefvhh7rzzTgAuvPBCvvGNb5RtzjnnHCKC8ePHc+SRRzJ+/HgAxo0bxzPPPENHRwfr169n8uTJAOzYsYPTTjtt379oDUNBknowbtw47rjjjne0vfzyy2zatIlDDz2Ut956q7T39DxAvQFS2+f9738/AAcddFBZ7lrfuXMnQ4YM4ayzzmLJkiV79X3q4fCRJPVgypQpvPrqqyxa1HlDyq5du7jiiiuYNWsWY8aMYfXq1bz11lts2rSJlStXlu2GDRvGm2++WfaxbNkytm/fDsCLL74IwOmnn87SpUsBWLx4MZ/85CfrrmvSpEk89NBDbNy4EYBXX32V3//+9/v/hTEUJKlHEcFdd93Fz3/+c8aOHctHP/pRWlpauOqqq5g8eTKjR49m/PjxfP3rX+ekk04q282dO5cJEyYwc+ZMxo0bx7e+9S0+/elPc+KJJ/K1r30NgBtuuIFbbrmFCRMmcNttt5VrDfVobW1l4cKFzJgxgwkTJjBp0iSeeOKJ3vnO3pLqLaldvCVVg113t28e6PrNLakRMTIi/iEiNkTEuoi4vGr/bkRsjojV1WtqzTZ/GxEbI+LJiDi7UbVJkrrXyAvNO4ErMvPRiDgEeCQi7q8+uy4zf1jbOSJOAC4AxgFHA7+MiI9m5q4G1ihJqtGwM4XM3JKZj1bLfwY2AMfsZpPpwNLMfCMznwY2Aqc2qj5J0nv1yYXmiBgFfBz4XdU0LyLWRMSCiDisajsG2FSzWQfdhEhEzI2IVRGxauvWrQ2sWpIGn4aHQkQcDNwB/E1mvgzcBBwHTAS2AD/q6trN5u+5Cp6ZN2dme2a2t7a2NqhqSRqcGhoKETGMzkBYnJl3AmTm85m5KzPfAn7K20NEHcDIms3bgOcaWZ8k6Z0adqE5Oh/P+xmwITOvrWk/KjO3VKufB9ZWy/cAt0fEtXReaB4LrESS+pHevo293lvBly9fzuWXX86uXbv48pe/zPz583u1ji6NvPtoMnAh8HhErK7avgnMiIiJdA4NPQP8FUBmrouIZcB6Ou9cusw7jySp80nqyy67jPvvv5+2tjZOOeUUpk2bxgknnNDrx2pYKGTmb+j+OsF9u9nmSuDKRtUkSQeilStX8pGPfIQxY8YAcMEFF3D33Xc3JBSc5kKS+rnNmzczcuTbl1zb2trYvHlzQ45lKEhSP9fddET1Tt29twwFSern2tra2LTp7ce4Ojo6OProoxtyLENBkvq5U045haeeeoqnn36aHTt2sHTpUqZNm9aQY/kjO5K0F5oxm/DQoUO58cYbOfvss9m1axdz5sxh3LhxjTlWQ/Yqab85rfvbnNYdpk6dytSpU/fccT85fCRJKgwFSVJhKEiSCkNBklQYCpKkwlCQJBXekipJe+HZ74/v1f0d++3H99hnzpw53HvvvRxxxBGsXbt2j/33h2cKktTPzZo1i+XLl/fJsQwFSernPvWpTzF8+PA+OZahIEkqDAVJUmEoSJIKQ0GSVHhLqiTthXpuIe1tM2bM4MEHH2Tbtm20tbXxve99j0suuaQhxzIUJKmfW7JkSZ8dy+EjSVJhKEiSCkNBkmpkZrNL6DX78l0MBUmqtLS0sH379gERDJnJ9u3baWlp2avtvNAsSZW2tjY6OjrYunVrs0vpFS0tLbS1te3VNoaCJFWGDRvG6NGjm11GUzl8JEkqDAVJUmEoSJKKhoVCRIyMiH+IiA0RsS4iLq/ah0fE/RHxVPV+WNUeEXFDRGyMiDURcVKjapMkda+RZwo7gSsy83hgEnBZRJwAzAceyMyxwAPVOsBngbHVay5wUwNrkyR1o2GhkJlbMvPRavnPwAbgGGA6cGvV7Vbg3Gp5OrAoO/0W+FBEHNWo+iRJ79Un1xQiYhTwceB3wJGZuQU6gwM4oup2DLCpZrOOqu3d+5obEasiYtVAuZdYkvqLhodCRBwM3AH8TWa+vLuu3bS957HCzLw5M9szs721tbW3ypQk0eBQiIhhdAbC4sy8s2p+vmtYqHp/oWrvAEbWbN4GPNfI+iRJ79TIu48C+BmwITOvrfnoHuDiavli4O6a9ouqu5AmAX/qGmaSJPWNRk5zMRm4EHg8IlZXbd8ErgaWRcQlwLPA+dVn9wFTgY3Aq8DsBtYmSepGw0IhM39D99cJAKZ00z+ByxpVjyRpz3yiWZJUGAqSpMJQkCQVhoIkqTAUJEmFoSBJKgwFSVJhKEiSCkNBklQYCpKkwlCQJBWGgiSpMBQkSYWhIEkqDAVJUmEoSJIKQ0GSVBgKkqTCUJAkFYaCJKkwFCRJhaEgSSoMBUlSYShIkgpDQZJUGAqSpMJQkCQVhoIkqTAUJEmFoSBJKhoWChGxICJeiIi1NW3fjYjNEbG6ek2t+exvI2JjRDwZEWc3qi5JUs/qCoWIeKCetndZCHymm/brMnNi9bqv2tcJwAXAuGqbn0TEkHpqkyT1nt2GQkS0RMRwYEREHBYRw6vXKODo3W2bmb8GXqyzjunA0sx8IzOfBjYCp9a5rSSpl+zpTOGvgEeAj1XvXa+7gR/v4zHnRcSaanjpsKrtGGBTTZ+Oqu09ImJuRKyKiFVbt27dxxIkSd3ZbShk5vWZORr4emaOyczR1evEzLxxH453E3AcMBHYAvyoao/uDt9DTTdnZntmtre2tu5DCZKkngytp1Nm/o+IOB0YVbtNZi7am4Nl5vNdyxHxU+DearUDGFnTtQ14bm/2LUnaf3WFQkTcRuf/8FcDu6rmBPYqFCLiqMzcUq1+Hui6M+ke4PaIuJbOaxVjgZV7s29J0v6rKxSAduCEzOx2SKc7EbEEOIPOi9QdwHeAMyJiIp2B8gyd1yzIzHURsQxYD+wELsvMXd3tV5LUOPWGwlrg39B5HaAumTmjm+af7ab/lcCV9e5fktT76g2FEcD6iFgJvNHVmJnTGlKVJKkp6g2F7zayCElS/1Dv3Ue/anQhkqTmq/fuoz/z9nMD7wOGAf+SmR9sVGGSpL5X75nCIbXrEXEuTkMhqY88+/3xzS6h3zj22483dP/7NEtqZv4f4C96uRZJUpPVO3z0hZrVg+h8bqHuZxYkSQeGeu8+OqdmeSedD55N7/VqJElNVe81hdmNLkSS1Hz1/shOW0TcVf2S2vMRcUdEtDW6OElS36p3+OgW4Hbg/Gr9S1XbWY0oSs3hHR5va/QdHlJ/Ve/dR62ZeUtm7qxeCwF/zECSBph6Q2FbRHwpIoZUry8B2xtZmCSp79UbCnOALwL/TOdMqecBXnyWpAGm3msK/x24ODNfAoiI4cAP6QwLSdIAUe+ZwoSuQADIzBeBjzemJElSs9QbCgdFxGFdK9WZQr1nGZKkA0S9/7D/CPjHiPg7Oqe3+CL+SpokDTj1PtG8KCJW0TkJXgBfyMz1Da1MktTn6h4CqkLAIJCkAWyfps6WJA1MhoIkqTAUJEmFoSBJKgwFSVJhKEiSCkNBklQYCpKkwlCQJBWGgiSpaFgoRMSCiHghItbWtA2PiPsj4qnq/bCqPSLihojYGBFrIuKkRtUlSepZI88UFgKfeVfbfOCBzBwLPFCtA3wWGFu95gI3NbAuSVIPGhYKmflr4MV3NU8Hbq2WbwXOrWlflJ1+C3woIo5qVG2SpO719TWFIzNzC0D1fkTVfgywqaZfR9X2HhExNyJWRcSqrVu3NrRYSRps+suF5uimLbvrmJk3Z2Z7Zra3trY2uCxJGlz6OhSe7xoWqt5fqNo7gJE1/dqA5/q4Nkka9Po6FO4BLq6WLwburmm/qLoLaRLwp65hJklS36n7l9f2VkQsAc4ARkREB/Ad4GpgWURcAjwLnF91vw+YCmwEXgVmN6ouSVLPGhYKmTmjh4+mdNM3gcsaVYskqT795UKzJKkfMBQkSYWhIEkqDAVJUmEoSJIKQ0GSVBgKkqTCUJAkFYaCJKkwFCRJhaEgSSoMBUlSYShIkgpDQZJUGAqSpMJQkCQVhoIkqTAUJEmFoSBJKgwFSVJhKEiSCkNBklQYCpKkwlCQJBWGgiSpMBQkSYWhIEkqDAVJUmEoSJIKQ0GSVAxtxkEj4hngz8AuYGdmtkfEcOB/A6OAZ4AvZuZLzahPkgarZp4pnJmZEzOzvVqfDzyQmWOBB6p1SVIf6k/DR9OBW6vlW4Fzm1iLJA1KzQqFBH4REY9ExNyq7cjM3AJQvR/RpNokadBqyjUFYHJmPhcRRwD3R8QT9W5YhchcgGOPPbZR9UnSoNSUM4XMfK56fwG4CzgVeD4ijgKo3l/oYdubM7M9M9tbW1v7qmRJGhT6PBQi4l9HxCFdy8B/ANYC9wAXV90uBu7u69okabBrxvDRkcBdEdF1/Nszc3lE/BOwLCIuAZ4Fzm9CbZI0qPV5KGTmH4ATu2nfDkzp63okSW/rT7ekSpKazFCQJBWGgiSpMBQkSYWhIEkqDAVJUmEoSJIKQ0GSVBgKkqTCUJAkFYaCJKkwFCRJhaEgSSoMBUlSYShIkgpDQZJUGAqSpMJQkCQVhoIkqTAUJEmFoSBJKgwFSVJhKEiSCkNBklQYCpKkwlCQJBWGgiSpMBQkSYWhIEkqDAVJUmEoSJKKfhcKEfGZiHgyIjZGxPxm1yNJg0m/CoWIGAL8GPgscAIwIyJOaG5VkjR49KtQAE4FNmbmHzJzB7AUmN7kmiRp0Bja7ALe5RhgU816B/CJ2g4RMReYW62+EhFP9lFtA96HYQSwrdl19AvfiWZXoBr+2azRO382P9zTB/0tFLr7tvmOlcybgZv7ppzBJSJWZWZ7s+uQ3s0/m32nvw0fdQAja9bbgOeaVIskDTr9LRT+CRgbEaMj4n3ABcA9Ta5JkgaNfjV8lJk7I2Ie8PfAEGBBZq5rclmDicNy6q/8s9lHIjP33EuSNCj0t+EjSVITGQqSpMJQkFOLqN+KiAUR8UJErG12LYOFoTDIObWI+rmFwGeaXcRgYijIqUXUb2Xmr4EXm13HYGIoqLupRY5pUi2SmsxQ0B6nFpE0eBgKcmoRSYWhIKcWkVQYCoNcZu4EuqYW2QAsc2oR9RcRsQR4GPi3EdEREZc0u6aBzmkuJEmFZwqSpMJQkCQVhoIkqTAUJEmFoSBJKgwFSVJhKGhAiIhXenl/o7qma46I9oi4YR/3kxFxW8360IjYGhH3VuvT9mW68oj4x32pR9qTfvUbzVJ/lJmrgFX7uPm/AP8uIj6Qma8BZwGba/Z9D/vwBHlmnr6P9Ui75ZmCBpSIOCMiHoyIv4uIJyJicURE9dnVEbE+ItZExA+rtoURcV7N9u8546j22fU/++9WP/zyYET8ISK+UkdZ/w/4j9XyDGBJzb5nRcSN1fL5EbE2Ih6LiF9XbeMiYmVErK7qHltb5x6+79Sq7TcRcUPXd5B2xzMFDUQfB8bRObHfQ8DkiFgPfB74WGZmRHxoP/b/MeBM4BDgyYi4KTPf3E3/pcC3q3+UJwALgH/fTb9vA2dn5uaa+v4auD4zF1dzUw3pZrvuvu8q4H8Cn8rMp6vpIqQ98kxBA9HKzOzIzLeA1cAo4GXgdeB/RcQXgFf3Y///NzPfyMxtwAvAkbvrnJlrqhpmAPftputDwMKI+Eve/sf/YeCbEfFfgQ9XQ1Dv1t33/Rjwh8x8uupjKKguhoIGojdqlncBQ6uJ/04F7gDOBZZXn++k+ntQDbu8b1/2X8c29wA/ZDf/OGfmXwP/jc6pzFdHxOGZeTswDXgN+PuI+Is66+nudzKkPXL4SINCRBwM/KvMvC8ifgtsrD56BjgZWEbnz5AOa1AJC4A/ZebjEXFGDzUel5m/A34XEecAIyPiUDr/x39DRIyhc/hpRR3HewIYExGjMvMZ4D/1yrfQgGcoaLA4BLg7Ilro/F/0V6v2n1btK4EH6LxbqNdlZgdw/R66XVNdSI6qlseA+cCXIuJN4J+B79d5vNci4j8DyyNiG7Byn4vXoOLU2dIAFREHZ+Yr1bDYj4GnMvO6Ztel/s1rCtLA9ZcRsRpYBxxK591I0m55piDtp4g4nM7hnnebkpnb+7oeaX8YCpKkwuEjSVJhKEiSCkNBklQYCpKk4v8DpA2+eK2TwkoAAAAASUVORK5CYII=\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "# 缺失值比较多，干脆就开一个新的字段，表明是缺失值还是不是缺失值\n",
    "train['Insulin_Missing'] = train['Insulin'].apply(lambda x: 1 if pd.isnull(x) else 0)\n",
    "sns.countplot(x=\"Insulin_Missing\", hue=\"Outcome\", data=train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "不过实际上我们并不知道特征是否缺失和分类结果有没有什么关系。  \n",
    "先假设特征缺失是随机的，将这新增的特征删除，直接用中值(Median)填补。  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "train.drop([\"SkinThickness_Missing\", \"Insulin_Missing\"], axis=1, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Pregnancies                 0\n",
      "Glucose                     0\n",
      "BloodPressure               0\n",
      "SkinThickness               0\n",
      "Insulin                     0\n",
      "BMI                         0\n",
      "DiabetesPedigreeFunction    0\n",
      "Age                         0\n",
      "Outcome                     0\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "medians = train.median() \n",
    "train = train.fillna(medians)\n",
    "\n",
    "print(train.isnull().sum())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 数据标准化"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "#  先分开输入(各项特征数据)与结果(是否罹患糖尿病)\n",
    "y_train = train['Outcome']   \n",
    "X_train = train.drop([\"Outcome\"], axis=1)\n",
    "\n",
    "# 用于保存特征工程之后的结果\n",
    "feat_names = X_train.columns\n",
    "\n",
    "# 数据标准化\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "\n",
    "# 初始化特征的标准化器\n",
    "ss_X = StandardScaler()\n",
    "\n",
    "# 分别对训练和测试数据的特征进行标准化处理\n",
    "X_train = ss_X.fit_transform(X_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 特征处理结果存为文件"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# 存为 csv 格式\n",
    "X_train = pd.DataFrame(columns = feat_names, data = X_train)\n",
    "\n",
    "train = pd.concat([X_train, y_train], axis = 1)\n",
    "\n",
    "train.to_csv('FE-pima-indians-diabetes.csv', index = False, header=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Pregnancies</th>\n",
       "      <th>Glucose</th>\n",
       "      <th>BloodPressure</th>\n",
       "      <th>SkinThickness</th>\n",
       "      <th>Insulin</th>\n",
       "      <th>BMI</th>\n",
       "      <th>DiabetesPedigreeFunction</th>\n",
       "      <th>Age</th>\n",
       "      <th>Outcome</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.639947</td>\n",
       "      <td>0.866045</td>\n",
       "      <td>-0.031990</td>\n",
       "      <td>0.670643</td>\n",
       "      <td>-0.181541</td>\n",
       "      <td>0.166619</td>\n",
       "      <td>0.468492</td>\n",
       "      <td>1.425995</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>-0.844885</td>\n",
       "      <td>-1.205066</td>\n",
       "      <td>-0.528319</td>\n",
       "      <td>-0.012301</td>\n",
       "      <td>-0.181541</td>\n",
       "      <td>-0.852200</td>\n",
       "      <td>-0.365061</td>\n",
       "      <td>-0.190672</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1.233880</td>\n",
       "      <td>2.016662</td>\n",
       "      <td>-0.693761</td>\n",
       "      <td>-0.012301</td>\n",
       "      <td>-0.181541</td>\n",
       "      <td>-1.332500</td>\n",
       "      <td>0.604397</td>\n",
       "      <td>-0.105584</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>-0.844885</td>\n",
       "      <td>-1.073567</td>\n",
       "      <td>-0.528319</td>\n",
       "      <td>-0.695245</td>\n",
       "      <td>-0.540642</td>\n",
       "      <td>-0.633881</td>\n",
       "      <td>-0.920763</td>\n",
       "      <td>-1.041549</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>-1.141852</td>\n",
       "      <td>0.504422</td>\n",
       "      <td>-2.679076</td>\n",
       "      <td>0.670643</td>\n",
       "      <td>0.316566</td>\n",
       "      <td>1.549303</td>\n",
       "      <td>5.484909</td>\n",
       "      <td>-0.020496</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Pregnancies   Glucose  BloodPressure  SkinThickness   Insulin       BMI  \\\n",
       "0     0.639947  0.866045      -0.031990       0.670643 -0.181541  0.166619   \n",
       "1    -0.844885 -1.205066      -0.528319      -0.012301 -0.181541 -0.852200   \n",
       "2     1.233880  2.016662      -0.693761      -0.012301 -0.181541 -1.332500   \n",
       "3    -0.844885 -1.073567      -0.528319      -0.695245 -0.540642 -0.633881   \n",
       "4    -1.141852  0.504422      -2.679076       0.670643  0.316566  1.549303   \n",
       "\n",
       "   DiabetesPedigreeFunction       Age  Outcome  \n",
       "0                  0.468492  1.425995        1  \n",
       "1                 -0.365061 -0.190672        0  \n",
       "2                  0.604397 -0.105584        1  \n",
       "3                 -0.920763 -1.041549        0  \n",
       "4                  5.484909 -0.020496        1  "
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
